Zoltán Konyha, VRVis, konyha@vrvis.at [PRIMARY
contact]
Andreas Ammer, VRVis, ammer@vrvis.at
Krešimir Matković, VRVis, matkovic@vrvis.at
Çağatay Turkay, University of Bergen, Cagatay.Turkay@ii.uib.no
Denis Gračanin, Virginia Tech, gracanin@vt.edu
We have developed a Python script to
preprocess the data set and compute data for pairs of sequences. There are 10
native and 58 current sequences in this challenge, thus 68*67=4556 pairs can be
constructed that represent possible mutations in the evolution of the virus.
Each pair is one item in our derived data set. Properties of each pair include:
·
ID1,
CATEGORY1: name and category ("native" vs. "current") of
the original sequence.
·
ID2,
CATEGORY2: name and category of the mutated sequence.
·
DIFFCOUNT:
number of differences between sequences ID1 and ID2.
·
DIFFBASES:
set of base substitutions the mutate sequence ID1 to ID2.
For "current" sequences we added disease characteristics, too.
We explored this data set in ComVis, our interactive, multiple linked
views visualization application. ComVis offers several types of views for
scalar, categorical and set type data. Each view is interactive and brushable. Brushes
defined in the same view or in different views can be combined using boolean
operators. The visual analysis context can be captured in session files.
Exchanging session files facilitates better collaboration among our team
members distributed in several cities.
ComVis supports visualization of set typed data in histograms. The
histogram includes one vertical bar for each possible element of the set typed
dimension. Items of the set typed dimension contribute to the bars pertaining
to all of their elements
In our initial attempts at MC3.3 and MC3.4 we used Jalview, a multiple sequence alignment
editor.
Video:
ANSWERS:
MC3.1:
What is the region or country of origin for the current outbreak? Please provide your answer as the name of the
native viral strain along with a brief explanation.
Nigeria_B
Explanation:
The origin of the current outbreak can be found by identifying
the native sequence that is most similar to the current ones. Figures 1.1 and
1.2 illustrate how the logical AND of three brushes reveals this information.
Figure 1.1: Top left: pairs where the
initial sequence is "native" are brushed (red rectangle). Top,
middle: items where the mutated sequence is "current" are brushed.
Top right: each point in this scatter plot represents one pair of sequences.
ID1 is on the horizontal axis, ID2 is on the vertical axis. Bottom: histogram
of the number of differences. (Click to enlarge.)
Figure 1.2: The red brush in the
bottom histogram narrows the focus to pairs with few differences. The red
highlighted points in the scatter plot indicate that all of the current strains
are similar to Nigeria_B. The brushed pairs are shown in the tabular view at
the bottom, too.
MC3.2:
Over time, the virus spreads and the diversity of the virus increases as it
mutates. Two patients infected with the
Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence
583. One patient has a strain identified
by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each
patient. Which patient likely contracted
the illness from Nicolai and why? Please
provide your answer as the sequence number along with a brief explanation.
123
Explanation:
We assume that the person who contracted the illness
from Nicolai has a strain that is more similar to Nicolai's. Therefore, we need
to find out whether sequence 123 or 51 is more similar to 583. Sequences 583
and 123 differ in one position (see Figure 2.1). Strains 583 and 51 differ in
three positions (see Figure 2.2). The patient with sequence 123 is more likely
to have contracted the illness from Nicolai.
Figure 2.1: Top left: select the pairs
of sequences where ID1 is 583 (Nicolai's sequence). Top right: select pairs
where ID2 is 123. The histogram at the bottom and the tabular view indicate
that the number of differences between those two sequences is 1.
Figure 2.2: Top left: select the pairs
of sequences where ID1 is 583 (Nicolai's sequence). Top right: select pairs
where ID2 is 51. The number of differences between those two sequences is 3.
MC3.3:
Signs and symptoms of the Drafa virus are varied and humans react differently
to infection. Some mutant strains from
the current outbreak have been reported as being worse than others for the
patients that come in contact with them.
Identify
the top 3 mutations that lead to an increase in symptom severity (a disease
characteristic). The mutations involve
one or more base substitutions. For this
question, the biological properties of the underlying amino acid sequence
patterns are not significant in determining disease characteristics.
For
each mutation provide the base substitutions and their position in the sequence
(left to right) where the base substitutions occurred. For example,
C
→ G, 456 (C changed to G at position 456)
G
→ A, 513 and T → A, 907 (G changed to A at position 513 and T
changed to A at position 907)
A
→ G, 39 (A changed to G at position 39)
A → T, 946, and T → C, 842
A → C, 269
A → G, 223
(positions are 1 based)
Explanation:
We look for the three most common mutations that change
symptoms from mild to severe.
Figure 3.1: Selecting mild symptoms
before mutation (top left) and severe symptoms after mutation (top right). The
largest red bars in the bottom histogram indicate the most common base
substitutions found in those pairs: 22GC, 161CG, 223AG, 269AC, 842TC and 946AT.
The highlighted base substitutions in Figure 3.1 are
found in mutations that increase symptom severity, but that does not mean that
all of them cause an increase. We can
brush them one by one and observe the change in symptom severity to find out
which have a decisive effect. This procedure is captured in the video and in
Figure 3.2. We found that 22GC and 161CG are not decisive in increasing symptom
severity, so they have been discarded.
Figure 3.2: All mutations that include
223AG change symptoms from mild to severe.
MC3.4:
Due to the rapid spread of the virus and limited resources, medical personnel
would like to focus on treatments and quarantine procedures for the worst of
the mutant strains from the current outbreak, not just symptoms as in the
previous question. To find the most
dangerous viral mutants, experts are monitoring multiple disease
characteristics.
Consider
each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the
most dangerous viral strains. The mutations involve one or more base
substitutions. In a worst case scenario,
a very dangerous strain could cause severe symptoms, have high mortality, cause
major complications, exhibit resistance to anti viral drugs, and target high
risk groups. For this question, the
biological properties of the underlying amino acid sequence patterns are not
significant in determining disease characteristics.
For
each mutation provide the base substitutions and their position in the sequence
(left to right) where the base substitutions occurred. For example,
C
→ G, 456 (C changed to G at position 456)
G
→ A, 513 and T → A, 907 (G changed to A at position 513 and T
changed to A at position 907)
A
→ G, 39 (A changed to G at position 39).
G → C, 848
T → C, 527
A → C, 269
(positions are 1 based)
Explanation:
Our initial answer to this
question (and the previous one) was based on analysis with Jalview. We have
filtered the sequences with a Python script and removed columns that are the
same in all sequences. The script prints a mapping from the "reduced"
column numbers to the original ones. Strains were sorted by their combined
disease characteristics. The most dangerous viral strains, 118, 123 and 501
have four out of five disease characteristics rated "most dangerous".
They are shown in the last three lines in Figure 4.1.
Figure 4.1: Viral sequences in
Jalview. Each line represents one sequence. Sequences are sorted by their
combined disease characteristics. The most dangerous ones are shown at the
bottom. The consensus diagram at the bottom indicates the most common bases in
each column. Colors (dark blue to white) indicate how often the given base
occurs at that position. Lighter colors indicate less frequent mutations.
To find base substitutions
that lead to the most dangerous viral strains we need to find bases that appear
often in the last three lines but rarely in the other ones. They are highlighted
by the yellow ovals in Figure 4.1. Unfortunately, the positions indicated by
the small red rectangles (and also displayed in the status bar) are valid in
the data set where matching columns have been removed. One needs to look up the
original position in the mapping printed by the script, which is not very
convenient. If some other consequences of mutations are to be explored, then
one needs to change the sorting in the script and start the Jalview session
from scratch.
We were not happy with the
flexibility and interactivity of this procedure. We tried a more interactive
solution (already presented in MC3.3), based on computing the base
substitutions that lead from the initial sequence to the mutated one in each
pair. The disease characteristics after mutation are displayed in a parallel
coordinates view. Each axis represents one disease characteristic. The most
dangerous strains can be selected by brushing the top of each axis. We expected
to create five brushes, and then observe the logical AND of those brushes. This
process is captured in Figures 4.2 and 4.3. The selected viral strains are
highlighted in the histogram on the right. The name of the bar under the mouse
pointer is shown under the middle of the histogram as the mouse is hovered over
the histogram. We can point at the red bars and learn that the most dangerous
viral strains are 118, 123 and 501. Four out of the five disease
characteristics are rated most dangerous for them, while complications are only
minor.
It is worth mentioning that we tried to find strains
that cause major complications while being less dangerous in some other
characteristic. One such strain is 211: major complications, high mortality,
resistant to anti viral drugs, but rated only moderate in the remaining two
characteristics. Strains 202 and 705 are also rated most dangerous in three
characteristics and moderate in the other two. We consider those strains less
dangerous than the ones with four top rated disease characteristics.
Figure 4.2: Top left: mutations that
lead to severe symptoms and high mortality are brushed. There is no red line
going through major complications, which indicates that there are no strains
with severe symptoms, high mortality and major complications. Top right: each
bar represents a sequence ID. Bottom: each bar represents a base substitution.
The red ones are involved in mutations that lead to the selected sequences.
Figure 4.3: All views show the same
data as in Figure 4.2, but in the parallel coordinates strains that are
resistant to anti viral drugs and target high risk groups are brushed in
addition.
The histogram at the bottom in Figure 4.3 displays
the base substitutions that lead to those viral strains. Each bar represents a
base substitution. The ones highlighted in red are included in the mutations
that lead to one of the three selected viral strains. Now we need to identify
the base substitutions that appear the most often within this subset.
This task is not completely intuitive. Each bar of
the histogram indicates the number of mutations that include the given base
substitution. The red parts of the bars indicate the number of mutations that
include the given base substitution and
lead to one of the most dangerous strains. An entirely red bar indicates that
the specific base substitution is included in all mutations that lead to those strains.
If a half of a bar is red, then the half of the mutations that include the
given base substitution lead to the selected viral strains. Therefore, we now need
to find bars in this histogram that contain the largest red parts relative to the entire bar.
This is a common pattern in analysis tasks and
ComVis includes a "relative" option in its histograms to facilitate
such comparisons. When this option is enabled (see Figure 4.4) all histogram
bars become equally long. That causes the histogram to loose its original
meaning, but it also makes it possible to directly compare percentages of the
brushed subsets.
We can see that two bars, 848GC and 527TC are
completely red. Those mutations always lead to very dangerous viral strains. A
half of 269AC is also brushed. They lead to the top three most dangerous
mutations.
Figure 4.4: Bottom: the
"relative" option of the histogram makes it possible to compare the
brushed percentages of bars in the histogram. Compare to Figure 4.3.
We can cross-check those results by performing a
reverse investigation: selecting all mutations that contain those base
substitutions. A part of this procedure is captured in Figure 4.5.
Figure 4.5: Bottom: the base substitution
"A → C, 269" is brushed. Top right: this base substitution is
found in mutations leading to four viral strains: 99, 118, 123 and 997. Two of
those, 118 and 123 are very dangerous. 99 and 997 are less aggressive towards
high risk groups.